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Abstract 

In this paper three problems related to the analysis of facial images are addressed: the estimation of 
the illuminant direction, the compensation of illumination effects and, finally, the recovery of the pose of 
the face, restricted to in-depth rotations. The solutions proposed for these problems rely on the use of 
computer graphics techniques to provide images of faces under different illumination and pose, starting 
from a database of frontal views under frontal illumination. 
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1 Introduction 

Automated face perception (localization, recognition 
and coding) is now a very active research topic in the 
computer vision community. Among the reasons, the 
possibility of building applications on top of existing re¬ 
search is probably one of the most important. While 
recent results on localization and recognition open the 
way to automated security systems based on face identi¬ 
fication, breakthroughs in the Held of facial image coding 
are of practical interest for teleconferencing and database 
applications. 

In this paper three tasks will be addressed: learn¬ 
ing the direction of illuminant for frontal views of faces, 
compensating for non frontal illumination and, finally, 
estimating the pose of a face, limited to in-depth rota¬ 
tions. The solutions we propose to these tasks share two 
important aspects: the use of learning techniques and 
the synthesis of the examples used in the learning stage. 
Learning an input/output mapping from examples is a 
powerful general mechanism of problem solving once a 
suitably large number of meaningful examples is avail¬ 
able. Unfortunately, gathering the needed examples is 
often a time consuming, expensive process. Yet, the use 
of a-priori knowledge can help in creating new, valid, ex¬ 
amples from a (possibly limited) available set. In this 
paper we use a rough model of the 3D head structure 
to generate from a single, frontal, view of a face under 
uniform illumination, a set of views under different poses 
and illumination using ray-tracing and texture mapping 
techniques. The resulting extended sets of examples will 
be used for solving the addressed problems using learn¬ 
ing techniques. 


2 Learning the illuminant direction 

In this section the computation of the direction of the 
illuminant is considered as a learning task (see [1, 2, 3, 4] 
for other approaches). The images for which the direc¬ 
tion must be computed are very constrained: they are 
frontal views of faces with a fixed interocular distance [5]. 
Once the illuminant direction is known it can be compen¬ 
sated for, obtaining an image under standard illumina¬ 
tion which can be more easily compared to a database of 
faces using standard techniques such as cross-correlation. 
Let us introduce a very simple lighting model [6]: 

I = (A + L6 cosw)A (1) 


where I represents the emitted intensity, A is the am¬ 
bient energy, 6 is 1 if the point is visible from the light 
source and 0 otherwise, u> is the angle between the inci¬ 
dent light and the surface normal, A is the surface albedo 
and L is the intensity of the directional light. Let us 
assume that a frontal image Ia of a face under diffuse 
ambient lighting (L = 6 = 0) is available: 

I A = AA (2) 


The detected intensity is then proportional to the sur¬ 
face albedo. Let us now assume that a 3D model of the 
same face is available. The corresponding surface can be 
easily rendered using ray-tracing techniques if the light 
sources and the surface albedo are given. In particular, 
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we can consider a constant surface albedo Ao and use a 
single, directional, light source of intensity L in addition 
to an appropriate level of ambient light A'. By changing 
the direction fl = (9, (f)) 1 of the emitted light, the corre¬ 
sponding synthetic image S(9,<f>,A r ) can be computed: 

S(0, <f>, A') = (A' + L6 cos w)A 0 (3) 

Using the albedo information Ia from the real image, a 
set of images 1(9, <f>, A') can be computed: 

1(9, </>, A') = -^—S(9, </>, A')I a « 5(1?, </>, A')I a (4) 

AoA 

In the following paragraphs it will be shown that even a 
very crude 3D model of a head can be used to generate 
images for training a network that learns the direction of 
the illuminant. The effectiveness of the training will be 
demonstrated by testing the trained network on a set of 
real images of a different face. The resulting estimates 
are in good quantitative agreement with the data. 

From a rather general point of view the problem of 
learning can be considered as a problem of function re¬ 
construction from sparse data [7]. The points at which 
the function value is known represent the examples while 
the function to be reconstructed is the input/output de¬ 
pendence to be learned. If no additional constraints are 
imposed, the problem is ill posed. The single, most im¬ 
portant constraint is that of smoothness: similar inputs 
should be mapped into similar outputs. Regularization 
theory formalizes the concept and provide techniques to 
select appropriate family of mappings among which an 
approximation to the unknown function can be chosen. 
Let us consider the reconstruction of a scalar function 
y = f(x): the vector case can be solved by considering 
each component in turn. Given a parametric family of 
mappings G(x',a ) and a set of examples {(ae*, J/*)} the 
function which minimizes the following functional is cho¬ 
sen: 

E(a) = ^2(yi - G(xi;a) - p(xi)) 2 (5) 

i 

where yi = /(*;) and p(x{) represent a polynomial term 
related to the regularization constraints. A common 
choice for the family G is that of linear superposition 
of translates of a single function such as the Gaussian: 

G(x- { Cj }, {i,-}, ID) = c.e-^r^mx-G) (6) 
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where W T W is a positive definite matrix representing 
a metric (the polynomial term is not required in this 
case). The resulting approximation structure can also 
be considered as an HyperBF network (see [7] for further 
details). In the task of learning the illuminant direction 
we would like to associate the direction of the light source 
to a vector of measurements derived from a frontal image 
of a face. 

In order to describe the intensity distribution over the 
face, the central region (see Figure 1) was divided into 
four patches, each one represented by an average inten¬ 
sity value computed using Gaussian weights (see Figure 

1 The angles 8 and <j> correspond to left-right and top-down 
displacement respectively. 



2). The domain of the examples is then 1Z 4 . Each input 
vector x is normalized to length 1 making the input vec¬ 
tors independent from scaling of the image intensities so 
that the set of images {1(0, <f>, A')}g^ can be replaced by 

{S(0,<f>,A')I A } H . 

The normalization is necessary as it is not the global 
light intensity which carries information on the direction 
of the light source, but rather its spatial distribution. Us¬ 
ing a single image under (approximately) diffuse lighting, 
a set of synthetic images was computed using eqn. (3). 
A rough 3D model of a polystyrene mannequin head was 
used to generate the constant albedo images 2 . The direc¬ 
tion of the illuminant spanned the range 0, <f> £ [—60, 60] 
with the examples uniformly spaced every 5 degrees. The 
illumination source used for ray tracing was modeled to 
match the environment in which the test images were 
acquired (low ambient light and a powerful studio light 
with diffuser). 

From each image S(6, <f>, A')Ia an example (xg^, 6) is 
computed. The resulting set of examples is divided into 
two subsets to be used for training and testing respec¬ 
tively. The use of two independent subsets is important 
as it allows to check for the phenomenon of overfitting 
usually related to the use of a network which has too 
many free parameters for the available set of examples. 
Experimentation with several network structures showed 
that a HyperBF network with 4 centers and a diagonal 
metric is appropriate for this task. A second network 
is built using the examples {(xg^,<f>)}g^. The networks 
are trained separately using a stochastic algorithm with 
adaptive memory [8] for the minimization of the global 
square error of the corresponding outputs: 

E e {oc) = ^2 (0 - Ge(x 0 (f> ; oc )) 2 

6tp 

E^ol) = ^2 (0 - G^xef, ol )) 2 

6tp 

The error Eg for the different values of the illuminant 
direction is reported in Figure 4. The network trained 
on the 0 angle is then tested on a set of four real images 
for which the direction of the illuminant is known (see 
Figure 5). The response of the network is reported in 
Figure 6 and is in good agreement with the true values. 

Once the direction of the illuminant is known, the 
synthetic images can be used to correct for it, providing 
an image under standard (e.g. frontal) illumination. The 
next section details a possible strategy. 

3 Illumination compensation 

Once the direction of the illuminant is computed, the 
image can be corrected for it and transformed into an 
image under standard illumination, e.g. frontal. The 
compensation can proceeds along the following steps: 

1. compute the direction ( 0 , <f>) of the illuminant; 


2 A public domain rendering package, Rayshade 4-0 by 
Craig Kolb, was used. 


2. establish a pixel to pixel correspondence between 
the image to be corrected X and the reference im¬ 
age used to create the examples I A 

( x x, yx) ^ (%i A , yi A ) (7) 

3. generate a view I(0,<f>,A) of the reference image 
under the computed illumination; 

4. compute the transformation due to the change in 
illumination between I a and I(0,<f>,A): 

Hx,y) = lA(x,y) - I(0,(j>,A-,x,y) (8) 

5. apply the transformation A to image X by using 
the correspondence map M in the following way 
[9]: 

X(x, y) ->■ X(x, y) + X(M x (x, y), M y (x, y)) (9) 

The pixel to pixel correspondence M can be computed 
using optical flow algorithms [10, 11, 12, 13]. However, 
in order to use such algorithms effectively, it is often nec¬ 
essary to pre-adjust the geometry of image X to that of 
I A [14]. This can be done by locating relevant features 
of the face, such as the nose and mouth, and warping 
image X so that the location of these features is the 
same as in the reference image. The algorithm for lo¬ 
cating the warping features should not be sensitive to 
the illumination under which the images are taken and 
should be able to locate the features without knowing 
the identity of the represented person. The usual way 
to locate a pattern within an image is to search for the 
maximum of the normalized cross-correlation coefficient 
p xy [15]. The sensitivity of this coefficient to changes in 
the illumination can be reduced by a suitable processing 
of the images prior to the comparison (see Appendix A). 
Furthermore, the identity of the person in the image is 
usually unknown so that the features should be located 
using generic templates (a possible strategy is reported 
in Appendix B). After locating the nose and mouth the 
whole face is divided into four rectangles with sides par¬ 
allel to the image boundary: from the eyes upwards, 
from the eyes to the nose base, from the nose base to 
the mouth and from the mouth downwards. The two in¬ 
ner rectangles are stretched (or shrunk) vertically so that 
the nose and mouth are aligned to the corresponding fea¬ 
tures of the reference image I a • The lowest rectangle is 
then modified accordingly. The image contents are then 
mapped using the rectangles affine transformations and 
a hierarchical optical flow algorithm is used to build a 
correspondence map at the pixel level. The transforma¬ 
tions are finally composed to compute the map M and 
image X can be corrected according to eqns. (8-9). One 
of the examples previously used is reported in Figure 8 
under the original illumination and under the standard 
one obtained with the described procedure. 

4 Pose Estimation 

In this section we present an algorithm for estimating 
the pose of a face, limited to in depth rotations. The 
knowledge of the pose can be of interest both for recog¬ 
nition systems, where an appropriate template can then 
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be chosen to speed up the recognition process [14], and 
for model based coding systems, such as those which 
could be used for teleconferencing applications [16]. The 
idea underlying the proposed algorithm for pose estima¬ 
tion is that of quantifying the asymmetry between the 
aspect of the two eyes due to in-depth rotation and map¬ 
ping the resulting value to the amount of rotation. It is 
possible to visually estimate the in-depth rotation even 
when the eyes are represented schematically such as in 
some cartoons characters where eyes are represented by 
small bars. This suggests that the relative amount of 
gradient intensity, along the mouth-forehead direction, 
in the regions corresponding to the left and right eye 
respectively provides enough information for estimating 
the in-depth rotation parameter. 

The algorithm requires that the location of one of the 
eyes is approximately known as well as the direction of 
the interocular axis. Template matching techniques such 
as those outlined in Appendix B can be used to locate 
one of the eyes even under large left-right rotations and 
the direction of the interocular axis can be computed us¬ 
ing the method reported in [17]. Let us assume for sim¬ 
plicity of notation that the interocular axis is horizontal. 
Using the projection techniques reported in [18, 5] we 
can approximately localize the region were both eyes are 
confined. For each pixel in the region the following map 
is computed: 


V(x,y) 


I d y C(x, y) | if | d y C(x, y)\ > | d x C{x, y)\ 
0 otherwise 


( 10 ) 

where C'(x,y) represent the local contrast map of the 
image computed according to eqn. (13) (see Appendix 
A). The resulting map assigns a positive value to pix¬ 
els where the projection of gradient along the mouth- 
forehead direction dominates over the projection along 
the interocular axis. In order to estimate the asymmetry 
of the two eyes it is necessary to determine the regions 
corresponding to the left and right eye respectively. This 
can be done by computing the projection P(x) ofV(x, y) 
on the horizontal axis given by the sum of the values in 
each of the columns. The analysis of the projections is 
simplified if they are smoothed: in our experiments a 
Gaussian smoother was used. The resulting projections, 
at different rotations are reported in Figure 9. These 
data are obtained by rotating the same 3D model used 
for the generation of the illumination examples: texture 
mapping techniques are then used to project a frontal 
view of a face onto the rotated head (see [13, 19] for al¬ 
ternative approaches to the estimation of pose and syn¬ 
thesis of non frontal views). 

The figure clearly shows that the asymmetry of the 
two peaks increases with the amount of rotation. The 
asymmetry U can be quantified by the following quan¬ 
tity: 

,, _J2 x<Xm p ( x )-J2 x>Xm p ( x ) 

J2 x<Xm p ( x ) + J2 x>Xm p ( x ) K ; 


where x m is the coordinate of the minimum between the 
two peaks. The value of U as a function of the angle of 
rotation is reported in Figure 10. Using the approximate 
linear relation it is possible to quantify the rotation of a 


new image. The pose recovered by the described algo¬ 
rithm from several images is reported in Figure 11 where 
for each of the testing images a synthetic image with the 
corresponding pose is shown. 

5 Conclusions 

In this paper three problems related to the analysis of 
facial images have been addressed: the estimation of the 
illuminant direction, the compensation of illumination 
effects and, finally, the recovery of the pose of the face, 
restricted to left-right rotations. The solutions proposed 
for these problems rely on the use of computer graphics 
techniques to provide images of faces under different il¬ 
lumination and pose starting from a database of frontal 
views under frontal illumination. The algorithms trained 
using synthetic images have been successfully applied to 
real images. 
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A Illumination sensitivity 

A common measure of the similarity of visual patterns, 
represented as vectors or arrays of numbers, is the nor¬ 
malized cross correlation coefficient: 


Pxy — 


Pxy 

PxxPyy 


( 12 ) 


where y xy represent the second order, centered moments. 

The value of \p xy \ is equal to 1 if the components of 
two vectors are the same modulo a linear transforma¬ 
tion. While the invariance to linear transformation of 
the patterns is clearly a desirable property (automatic 
gain and black level adjustment of many cameras in¬ 
volve such a linear transformation) it is not enough to 
cope with the more general transformations implied by 
changes of the illumination sources. A common approach 
to the solution of this problem is to process the visual 
patterns before the estimation of similarity is done, in or¬ 
der to preserve the necessary information and eliminate 
the unwanted details. A common preprocessing opera¬ 
tion is that of computing the intensity of the brightness 
gradient and use the resulting map for the comparison 
of the patterns. Another preprocessing operation is that 
of computing the local contrast of the image. A possible 
definition is the following: 


where 


f C if C < 1 

[ 2-T if C" > 1 


C" = 


I 

I * K G (a) 


(13) 

(14) 


and Kcj( a ) is a Gaussian kernel whose a is related to the 
expected interocular distance. It is important to note 
that C saturates in region of high and low local contrast 
and is consequently less sensitive to noise. 

Recently some claims have been made that the gra¬ 
dient direction Held has good properties of invariance to 



changes in the illumination [20]. In the case of the di¬ 
rection held, where a vector is associated to each single 
pixel of the image, the similarity can be computed by 
measuring the alignment of the gradient vectors at each 
pixel. Let gi(x,y) and g 2 (GJ/) be the gradient fields of 
the two images and || • || represent the usual vector norm. 
The global alignment can be defined by 


A 


1 

E ( X , y ) w ( x ,y) 


(15) 


x £ 

IlSiO.CIUlSLO.CIG 0 


w(x,y) 


gi(x,y) ■g2( x ,y) 
Il9i0 c >2/)llll92( a b2/)ll 


where 

w ( x > y) = \(\\9i{x, 2/) 11 + II g 2 (x, 2/)||) (16) 


The formula is very similar to the one used in [20] (a nor¬ 
malization factor has been added). The following prepro¬ 
cessing operators were compared using either the nor¬ 
malized cross-correlation-coefficient p or the alignment 

A: 

plain: the original brightness image convolved with a 
Gaussian kernel of width <r; 

contrast: each pixel is represented by the local image 
contrast as given by eqn.(13); 

gradient: each pixel is represented by the brightness 
gradient intensity computed after convolving the 
image with a Gaussian kernel of standard deviation 
a: 

||V(AG * I(x, t/))|| (17) 


gradient direction: each pixel is represented by the 
brightness gradient of N a -kl(x,y ). The similarity 
is estimated through the coefficient A of eqn.(16) 

laplacian: each pixel is represented by the value of the 
laplacian operator applied to the intensity image 
convolved with a Gaussian kernel. 

For each of the preprocessing operators, the similarity of 
the original image under (nearly) diffuse illumination to 
the synthetic images obtained through eqn. (4) was com¬ 
puted. The corresponding average values are reported 
in Figure 12 for different values of the parameter a of 
the preprocessing operators. The local contrast operator 
turns out to be the less sensitive to variations in the il- 
luminant direction. It is also worth mentioning that the 
minimal sensitivity is achieved for an intermediate value 
of a: this should be compared to the monotonic behav¬ 
ior of the other operators. Further experiments with 
the template-based face recogntion system described in 
[5] have practically demonstrated the advantage of using 
the local contrast images for the face recognition task. 


B Alternative Template Matching 

The correlation coefficient is quite sensitive to noise and 
alternative estimators of pattern similarity may be pre¬ 
ferred. Such measures can be derived from distances 


other than the Euclidean, such as the L\ norm defined 
by: 

n 

<h(x,y) = £ \ x i ~ yA ( 18 ) 

8=1 

where n is the dimension of the considered vectors. A 
similarity measure based on the L\ norm can be intro¬ 
duced: 


*(*', y') 



1 


\ x 'i-y'i\ 

\ x \\ + \y'i\ 


(19) 


that satisfies the following relations: 

/(x',y') E [0,1] 

/(x',y') = l O x'=y' 

/(x',y') = 0 o x' = -y' 

where x' and y' are normalized to have zero average and 
unit variance. The characteristics of this similarity mea¬ 
sure are extensively discussed in [21] where it is shown 
that it is less sensitive to noise than p xy and technically 
robust [22]. Hierarchical approaches to the computation 
of correlation, such as those proposed in [23] are readily 
extended to the use of this alternative coefficient. 

The influence of template shape can be further re¬ 
duced by slightly modifying l(x,y). Let us assume that 
the template T and the corresponding image patch are 
normalized to zero average and unit variance. We de¬ 
note by 12/ (x) a the 4-connected neighborhood of point 
x in image I and fsi r (x)(*) the intensity value in 12/(x) 
whose absolute difference from w is minimum: if two 
values qualify, their average (u>) is returned. A modified 
l(x,y) can then be introduced: 

,// x _ V' 1 (\ — Enj(x+y) ~ (T(x)) T(x)| \ 

|2 r n l(x +y ) | + |(T(x)) T(x)|y 

The new coefficient introduces the possibility of local de¬ 
formation in the computation of similarity (see also [24] 
for an alternative approach). 
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Figure 1: The facial region used to estimate the direction of illuminant. Four intensity values are derived by 
computing a weighted average, with Gaussian weights, of the intensity over the left (right) cheek and left (right) 
forehead-eye regions. 



Figure 2: Superimposed Gaussian receptive fields giving the four dimensional input of the HyperBF network. Each 
field computes a weighted average of the intensity. The coordinates of the plot represent the image plane oordinat.es 
of Figure 1. 



Figure 3: Computer generated images (left) are used to modulate the intensity of a single view under approximately 
diffuse illumination (center) to produce images illuminated from different angles (right). The central images are 
obtained by replication of a single view. The right images are obtained by multiplication of the central and left 
images (see text for a more detailed description). 


6 






Figure 4: Error made by a 4 units HyperBF network on estimating the illuminant direction on the 169 images of the 
training set. The horizontal axis represents the left-right, position of the illuminant while.'the vertical axis represents 
its height. The intensity of the squares is proportional to the squared .error: the lighter the squarffi the greater the 
error is. 



Figure 5: Some real images on which the algorithm trained on the synthetic examples has been applied. 


Computed vs. real angle 



-50 -40 -30 -20 -10 0 

Real (degrees) 


Figure 6: Illuminant direction as estimated by the HyperBF network compared to the real data for the four test 
images. 
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Figure 7: The first three images represent respectively the original image, the image obtained by fixing the nose and 
mouth position to that of the reference image (the last in the row) and the refined warped image obtained using a 
hierarchical optical flow algorithm. 



Figure 8: The original image (left) and the image corrected using the procedure described in the text. 


Gradient projection vs. head rotation 



0 


Figure 9: The drawing reports the dependence of the gradient projection on the degrees of rotation around the 
vertical image axis. The projections are smoothed using a Gaussian kernel of a = 5. Note the increasing asymmetry 
of the two peaks. 
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Rotation vs. asymmetry 



Figure 10: The drawing reports the dependence of the asymmetry of the projection peaks on the degrees of rotation 
around the vertical image axis. The values are computed by averaging the data from three different people. 



Figure 11: The top row reports the test images while the bottom row shows the images generated using a simple 3D 
model and the rotation estimated using the approximately linear dependence of the gradient projection asymmetry 
on the rotation around the vertical image axis 
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Figure 12: Sensitivity to illumination of some common preprocessing operators. The abscissas represent the values 
of a (see text for an explanation). 
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